Approach

Uncontrained Optimization

  1. Use a combination of prediction and interpolation to predict all values oftime and accuracy under different hardware/software configurations.
  2. Find the minimal combination of algorithm inputs that maximize accuracy. If there are ties, break them by using the point that requires the least data.
  3. Find the costs associated with running the algorithm with those inputs on all different hardware configurations.
  4. Find the combination of hardware that jointly minimizes cost and time.

Data-contrained Optimization

  1. Use a combination of prediction and interpolation to predict all values oftime and accuracy under different hardware/software configurations.
  2. Subset the accuracy surface produced above to the amount of data available. The maximizing point will fall in the upper right corner of the subsetted space.
  3. Find the costs associated with running the algorithms with the accuracy-maximizing point on all different hardwares.
  4. Find the combination of hardware that jointly minimizes time and cost.

Cost-constrained Optimization

  1. Use a combination of prediction and interpolation to predict all values oftime and accuracy under different hardware/software configurations.
  2. Subset the space produced above to the amount of time and money able to be spent on modelling.
  3. Working backwards now, find the accuracies that can be produced in the limited time.
  4. Using the subset of accuracy space, find the combination of algorithm inputs that maximizes accuracy.
knitr::opts_chunk$set(cache=TRUE, echo=F, warning=F, error = F, message=F)
knitr::opts_knit$set(root.dir = "/users/scottsfarley/documents")
setwd("/users/scottsfarley/documents")
library(parallel)
library(doParallel)
## Loading required package: foreach
## Loading required package: iterators
library(akima)
library(ggplot2)
options(java.parameters = "-Xmx1500m")
library(bartMachine)
## Loading required package: rJava
## Loading required package: bartMachineJARs
## Loading required package: car
## Warning: replacing previous import 'lme4::sigma' by 'stats::sigma' when
## loading 'pbkrtest'
## Loading required package: randomForest
## randomForest 4.6-12
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
## 
##     margin
## Loading required package: missForest
## Loading required package: itertools
## Welcome to bartMachine v1.2.3! You have 1.4GB memory available.
## 
## If you run out of memory, restart R, and use e.g.
## 'options(java.parameters = "-Xmx5g")' for 5GB of RAM before you call
## 'library(bartMachine)'.
bartMachine::set_bart_machine_num_cores(3)
## bartMachine now using 3 cores.
library(reshape2)

threshold.time <- 22 ##seconds
threshold.cost <- 30 ##cents
threshold.numTex <- 2500

First, get the training data and fit the model. Perform some skill checks on it.

## bartMachine initializing with 50 trees...
## bartMachine vars checked...
## bartMachine java init...
## bartMachine factors created...
## bartMachine before preprocess...
## bartMachine after preprocess... 6 total features...
## bartMachine sigsq estimated...
## bartMachine training data finalized...
## Now building bartMachine for regression ...
## serializing in order to be saved for future R sessions...done
## bartMachine initializing with 50 trees...
## bartMachine vars checked...
## bartMachine java init...
## bartMachine factors created...
## bartMachine before preprocess...
## bartMachine after preprocess... 6 total features...
## bartMachine sigsq estimated...
## bartMachine training data finalized...
## Now building bartMachine for regression ...
## serializing in order to be saved for future R sessions...done

Choose a finite number of possible solutions to the model. Ideally, we would want every single combination of predictor variables [0, Inf]. This is obviously intractable. Moreover, I only have data for a subset of that space anyways. So randomly sample the subspace in which I have data to make the problem possible to solve.

Using that subset of data and the models we fit previously, predict each candidate configuration of algorithm inputs and hardware variables for execution time and SDM accuracy.

Plot the posterior means of the accuracy models against the algorithm inputs that should control accuracy. In this case, these are number of training examples and number of covariates.

The accuracy clearly varies from low (few training examples and few covariates) to very high (many covariates, many training examples). Perhaps more data would be helpful here, but what are you going to do. Our task is to find the combinations of inputs that results in the highest accuracy model. If there’s a tie, find the combination that needs the least data.

Now, we know the combination of algorithm inputs that result in the highest accuracy. The figure below shows the combination identified on the training examples and covariates axes. This combination of training examples and number of covariates can be run on any combination of hardware. Some might be suboptimal. Thus, at this point, we’ve solved half of our challenge: algorithm inputs have been optimized, now it’s time optimize hardware.

## [1] "Accuracy is maximized at 10000 training examples and 5 predictors."

In theory, the hardware parameters should not affect the SDM accuracy. We can test this assumption here, by plotting the accuracies obtained for this combination of algorithm inputs against modeled accuracy on the number of CPUs and amount of memory. If the assumption is valid, the plot should show no change in either the horizontal or vertical directions. We see that there is, in fact, some change, though. This is likely due to expeirmental design, and lack of a full factorial design setup. The effect is realtively minor, and I choose to comment it and move along.

## [1] "Accuracy Range on Hardware:  0.0146282801147767"
## [1] "Accuracy Range from Expectation:  0"
## [1] "------"
## [1] "Fixing accuracy at:  0.808736414727664"

Now, fix the algorithm inputs at the accuracy-maximizing point– effectively fixing expected model accuracy. An algorithm with these inputs can be run on any combination of hardware. Project how long that specific model would take and how much it would cost on all computing types. Plot those out on time vs. cost axes.

The optimal solution is the one that balances time and cost equally during the minimization. We use euclidean distance here, which normalizes each dimension by its standard deviation, so they are weighted equally. For each candidate combiantion of hardware, we calculate the distance between it and the origin of these two axes. We then find the minimum of that distance matrix and call that point the optimal.

Our job is complete. We’ve now optimized both the harware and software dimensions of the problem.

## [1] "------GBM-BRT OPTIMAL--------"
## [1] "Predicted Optimal Accuracy 0.808736414727664 +/- 0"
## [1] "Predicted Optimal Cost (seconds) 1489.01971037258"
## [1] "Predicted Optimal Cost (cents) 89.5198649875996"
## [1] "Cores:  2"
## [1] "Memory: 2"
## [1] "Training Examples: 10000"
## [1] "Covariates: 5"
## [1] "Distance from origin is:  1491.70825033096"

Everything up to this point was done using the mean of the posterior distribution, a choice which simplifies the process but causes some information loss and may cause over-confidence in the predictions. We can modify our steps to include information from the entire posterior, which may solve this issue.

Instead of projecting just the mean time and mean cost for use the the distance minimization, use the entire set of posterior samples. Calculate the distance metric for each sample in the posterior independently. You’re then left with a density distribution of distances, from which we can infer the minimum value.

The posteriors are in a line, since there’s a fixed linear relationship between time and cost.

Now, find the distance metrics for all of those points.

There’s a lot of overlab in this figure, and many points are far away from the optimal. We don’t care about those. Take the few closest to the minimum and look at their distributions.

Now, the optimal configuration may be one of the following:

config cores GBMemory seconds cost distance.mean distance.sd
13 2 2 1489.020 89.51986 1796.690 1545.021
97 9 2 2386.043 645.52012 2936.133 1781.918
109 10 2 2386.043 717.24458 2959.525 1796.115
73 7 2 2516.034 529.42388 3056.553 1860.960
49 5 2 2551.984 383.56320 3067.172 1865.487
85 8 2 2516.034 605.05587 3076.325 1872.998
61 6 2 2551.984 460.27584 3082.043 1874.532
121 11 2 2484.284 821.45323 3106.939 1885.104
133 12 2 2484.284 896.13080 3135.908 1902.680
145 13 2 2489.361 972.79267 3175.095 1930.906
157 14 2 2489.361 1047.62288 3208.520 1951.233
169 15 2 2489.361 1122.45308 3244.037 1972.832
181 16 2 2489.361 1197.28329 3281.579 1995.663
2 1 4 2940.606 147.32434 3292.283 1803.451
3 1 5 2941.947 176.86984 3295.195 1804.163
193 17 2 2489.361 1272.11349 3321.076 2019.682
204 17 22 2118.844 8301.24771 3362.460 2044.850
25 3 2 2779.937 250.69475 3371.967 2154.571
37 4 2 2779.937 334.25967 3382.529 2161.319
216 18 22 2118.844 8789.55640 3405.662 2071.123
228 19 22 2118.844 9277.86509 3450.614 2098.460
240 20 22 2118.844 9766.17378 3497.248 2126.820
252 21 22 2118.844 10254.48247 3545.499 2156.163
264 22 22 2118.844 10742.79116 3595.299 2186.449
276 23 22 2118.844 11231.09985 3646.588 2217.639

In the results above, you’re accutally seeing the trade off between time and money play out quite nicely. Adding cores costs money, but, in the case of GBM-BRTs, reduces time. Here, that tradeoff basically exactly evens out.

Cost-Constrained Optimization

There are two main types of constraints on this optimization problem: (1) limited time and/or money and (2) limited data. In the first case, the researcher only has so much time or money (or both) available to be spent on modelling. She must optimize her workflow so that she can get the most out of the limited funds she has available to her. For one experiment, this doesn’t really hold water, since there are only cents and seconds being spent on computing the models. However, when think about global syntheses with many species, these add up to be significant expenditures.

In this simple example, the researcher has a hard maximum of 10 seconds and 25 cents to be spent on the model.

First, calculate a surface of all the possible configurations, whether or not they meet her threshold.

Now, subset that surface, retaining only the configurations that satisfy her threshold in time and in money. Find the maximum amount of data that can be used, and calculate all the possible accuracies that could be achieved using it. This down-weights expensive computing types, and encourages the solution to have a high accuracy.

## [1] "There are  3062 candidates."

Notice the change is scales. We’re not able to get to a point with more than 4000 training examples now. Instead, we’ve limited to low data, and lower accuracy, because of the time/money constraint.

## [1] "Accuracy is maximized at 110 training examples and 5 predictors."

## [1] "Expected accuracy in this scenario is:  0.700174303687692"
## [1] "Fixing training examples at:  110"
## [1] "Fixing covariates at:  5"
## [1] "287 candidates fit within these thresholds."

Finally, come back around, and find the computing hardware that’s best for these inputs.

## [1] "Now there are only:  287 candidates, instead of  287 candidates that can complete this scenario under budget."

## [1] "Recommended # cores:  2"
## [1] "Recommended Memory:  4"
## [1] "Expected Cost:  0.277683052076304"
## [1] "Expected Seconds:  2.77128794487329"
## [1] "Expected accuracy:  0.700174303687692"

Data Constraint

In this case, we’ve got a constraint on the amount of data available to us.

## [1] "Current data threshold is  2500"
## [1] "Accuracy is maximized at 2000 training examples and 5 predictors."
## [1] "Expected Max Accuracy is  0.795928229235456"
## [1] "Now there are only:  287 candidates, instead of  287 candidates that can complete this scenario under budget."

## [1] "Recommended # cores:  5"
## [1] "Recommended Memory:  4"
## [1] "Expected Cost:  1.28306472326009"
## [1] "Expected Seconds:  5.12201486331373"
## [1] "Expected Accuracy:  0.795928229235456"